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Abstract 

Tb b^apn* s4ehwe 4^&pmputational problem of Inverse km«n*tici and inverse dynamics of robot manipulators by 
taking advantage of parallelism and pipelining architectures^ For the computation of inverse kinematic position solu- 
tion, a maximum pipelined CORD1C architecture has been designed based on a functional decomposition of the closed- 
form joint equations* For the inverse dynamics computation, an efficient p-fold parallel algorithm to overcome the 
recurrence problem of the Newton-Euler equations of motion to achieve the time lower bound of Offlog-n]) has also 
been developed. 


1* Introduction 

Robot manipulators are highly nonlinear systems, and their motion control is usually specified in terms of the path 
traveled by the manipulator hand in Cartesian coordinates. To perform a simple kinematic path control, the controller 
is required to compute accurately the joint angles of the manipulator along the desired Cartesian path at an adequate 
and acceptable rate. To perform a dynamic path-tracking control, one must repeatedly compute the required general- 
ized forces, from an appropriate manipulator dynamics model, using the measured data of displacements and velocities 
of all the joints, and the accelerations computed from some justifiable formulae or approximations, to drive all the joint 
motors. In order to achieve fast convergence of the control algorithm, a sampling rate of no less than 60 Hz is prefer- 
able because the mechanical resonant frequency of most industrial manipulators is around 5—10 Hz. The above 
kinematic and dynamic path control reveals a basic characteristic and common problem in robotic manipulator control 
— intensive computations with a high level of data dependency. Despite their impressive speed, conventional general- 
purpose uniprocessor computers can not efficiently handle the kinematic and dynamic path control computations at the 
required computation rate because their architectures limit them to a mostly serial approach to computation, and there- 
fore limit their usefulness for robotic computational problems. This paper addresses these intensive robotic computa- 
tional problems by taking advantage of parallelism and pipelining architectures. 

Considering that most industrial robots have simple geometry, the kinematic path control requires the computa- 
tion of the solution of joint angles which can be obtained by various techniques. The inverse transform technique (l) 
yields a set of explicit, closed-form, non-iterative joint angle equations which involve multiplications, additions, square 
root, and transcendental function operations. Based on an actual implementation on a multiprocessor systemf [2,3) hav- 
ing a circuit to synchronize the CPUs and software scheduling for computing the joint solution, the best reported com- 
putation time was 3.6 ms for a six-link manipulator versus 20 ms running on a uniprocessor system. If we use a 
CORDIC (Coordinate Rotation Digital Computer) architecture [4], the computation time reduces to 40 ^z, a speed-up 
factor of 500tt* 

For the dynamic path-tracking control, there are a number of ways to compute the generalised forces/torques 
applied to the joint motors [5], among which the computation of joint torques from the Newton-Euler (NE) equations of 
motion is the most efficient and has been shown to possess the time lower bound of 0(n) running in uniprocessor com- 
puters [6,7], where n is the number of degrees of freedom (DOF) of the manipulator. Based on the study of Luh, Walker, 
and Paul [7j, it requires (150n - 48) multiplications and (131n — 48) additions per trajectory set point for a manipula- 
tor with rotary joints. It is unlikely that further substantial improvements in computational efficiency can be achieved, 
since the recursive NE equations are efficiently computing the minimum information needed to compute the generalized 
forces/torques: angular velocity, finer and angular acceleration, and joint forces and torques. For a Stanford robot arm 
(a total of 308 multiplications and 254 additions is required to compute the joint torques [8j), this amounts to 25 ms 
processing time on a uniprocessor system and 5.69 ms running on an experimental multiprocessor system with 7 proces- 
sors [9j. If we use the parallel algorithm with 6 processors as proposed in this paper, this reduces the computation from 
852 multiplications and 738 additions running on a uniprocessor to 197 multiplications and 183 additions for a PUMA 


This work WU supported in part by the National Science Foundation Engineering Research Center Grant CDRSS00022. Any opinions, findings, and 
conclusions or recommendations expressed in this articie are those of the author and do not necessarily reflect the views of the fuadiag agency, 
t The ms It i processor system consists of a M CSS 09 CPU and seven 2*0 CPUs. Each ZM is accompanied by two 9511 APUs, local memory, and 2/0 
interfaces. 

ft A speed-up factor is defined as the ratio of the compstational time of a task rosning on a uniprocessor system to the computational time of the 
same task running on the proposed architecture (i.e., 20 ms/ 40/1 s = 500). 
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robot (due to different kinematic structure of PUMA and Stanford robots, direct comparison on processing time is 
invalid) [10]. 

This paper discusses the development of a maximum pipelined CORDIC architecture for the computation of 
inverse kinematic position solution to achieve the pipelined time of 40 and an efficient p-fold parallel algorithm to 
achieve the time lower bound of computing the joint torques. The CORDIC architecture was designed based on a func- 
tional decomposition of the closed-form joint equations. Delay buffers are necessary to balance the pipelined CORDIC 
architecture vo achieve maximum pipelining. The buffer assignment problem is solved by the integer linear program- 
ming technique. The efficient p-fold parallel algorithm can be best described as consisting of p -parallel blocks with 
pipelined elements within each parallel block to achieve the time lower bound of O^log**]) of computing the joint 
torques based on the Newton-Euler equations of motion, where n is ihe number of degrees of freedom of the manipula- 
tor. The algorithm can be implemented with a group of microprocessors without complex intercommunication among 
processors and bussing of data. A modified inverse shuffle scheme is suggested for connecting the processors together 
with efficient intercommunications. 


2. Inverse Kinematic Position Computation 

The general kinematic problem of a ft- DOF robot arm concerns the problem of finding the generalised coordi- 
nates q = together with the vector of their generalised velocities and the vector of their generalised 

accelerations in the n- dimensional space such that the characteristics of the motion of the free end, the hand, coincide 
with the pre-$pecified Cartesian trajectory. This inverse problem has earned considerable attention because of its 
importance in relating the Cartesian trajectory of the hand to the corresponding joint-variable trajectory of the mani- 
pulator. This paper focuses only on the inverse kinematic position solution. 

In solving the inverse kinematic position problem, we are always interested in obtaining a closed-form solution (ue. 
an algebraic equation relating the given manipulator hand position and orientation to one of the unknown joint dis- 
placements), which yields all the possible solutions in a fixed computation time. Fortunately, most industrial robots 
have simple geometry and exhibit closed- form joint solution. Utilising the inverse transform technique [l], the joiat 
angle equations of a six-link manipulator with simple geometry reveal the computation of a large set of elementary 
operations: real number multiplications, additions, divisions, square roots, trigonometric functions and their inverse. 
However, these elementary operations, in general, cannot be efficiently computed in general-purpose uniprocessor com- 
puters. In order to obtain a fixed computation time for the joint angle solution, time-consuming transcendental func- 
tions (sine, cosine, and arc tangent) are implemented as table look-up at the expense of the solution accuracy. The 
CORDIC algorithms [11-14] are the natural candidates for efficiently computing these elementary operations. They 
represent an efficient way to compute a variety of functions related to coordinate transformations with iterative pro- 
cedures involving only shift-and-add operations at each step. Thus, cordic processing elements are extremely simple and 
quite compact to realise [14 J and the interconnection of CORDIC processors to exploit the great potential of pipelining 
and multiprocessing provides a cost-effective solution for computing the Inverse kinematic position solution. 


2.1. CORDIC Algorithms and Processors 

In conventional uniprocessor computers, computation of elementary functions such as square roots, sine, t cosine, 
hyperbolic sine and cosine and their inverse consumes a considerable amount of effort than multiplication operation. 
These elementary functions can be efficiently computed by the cordic algorithms which can be described by a single set 
of iterative equations parametrized by a quantity m ( = —1,0,1) which determines the type of rotations. To establish 
connections between CORDIC and rotation-based algorithms, let the angle of rotation 0 be decomposed into a sum of a 
sub-angles {d,; i = 0, n — 1} 

* = E (i) 

xM) 

where the sign u, (±1) is chosen based on the direction of rotation. Similarly, the plane rotation matrix R((?) 
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can also be decomposed into a product of sub-angle rotation matrices 

m = nRK) 
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Thus, a single rotation of Q angle can be replaced by n smaller rotations with d,- angle each. In the cordic algorithms, 
d, is chosen such that 
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where m determines the type of rotations and {«(»); » * 0, n— 1} is a non-decreasing integer sequence. Using df from 
(4), R(d,) can be written as 




1 

1 


(S) 


where p, is a scaling factor and equals to (1 + n2‘ Sl l , *)'\ Let R*(f) and R w (d,) be the normalised form of 11(1) and 
R(d,) t respectively, then from (3), we have 

rw - n n ill »*«) - (••*) 

■ ■4 i* 4 ) #*0 


where 


; R s w - n 
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Usually, k m is a machine constant and = 0.6072 (for m — l) or 1-00 (for m =* 0) or 1.205 (for m * — l) t when n > If 
[12, IS]. The normalised rotation matrix of (6.b) indicates that each small rotation can be realised with one simple 
shift-and-add operation. Hence, the computation of a trigonometric function can be accomplished with n shift-and-add 
operations, which is comparable to conventional multiplications. This makes a CORDIC ALU a very appealing alterna- 
tive to the traditional ALU for implementing the elementary functions. In general, the normalised CORDIC algorithm 
can be written as follows: 

FOR t= 0,1, • • • , n-l,DO 




1 -m «, 2-< 1 > 
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(7.b) 


where z$ = xq, y% = y^ m determines the type of rotation, d, is chosen as in (4), and the auxiliary variable if* is intro- 
duced to accumulate the rotation after each iteration. And the corresponding “unnormalised” CORDIC algorithm is 
described as: 

FOR i= 0,1, • , n— 1, DO 




1 -m u, 2~* l '> 
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= P. 

«, 2-<‘> 1 




*+i = + *,4 (8J>) 

where x 0 - x 0 and y„ — y ir It can be shown that x, and z, v will accumulate the angle of the total rotation and have the 
same value after n iterations. However, the end results of (x,,*.) from the iterations of ( 8 .a) and the end results of 
(x^y*) from the iterations of (7.a) are related according to 

*. “ *m ; ». = 9m (*) 

Consequently, one may evaluate z; v and y* by using only tbe shift-and-add operations in ( 8 .a), then realise z. and y, by 
other simple methods such as ROM look-up tables and regular combinatorial logic, etc. Fortunately, it is possible to 
find a simple way to normalise tbe scale factor k* using the same shift-and-add hardware [14, 15]. The supplementary 
operations that are used to force the scale factor km to converge toward unity can be either performed after all the 
operations of (7.a) are terminated, that is, 

- (1 + 7 , 2 -) x* ; y / +1 - (1 + 7 , 2 -‘)rf ( 10 ) 

where x£ = x, v , y 5 = y. v , and 0 < 1 < a— 1 , or interleaved with the operations of (7 .a), that is, 

= (1 + % 2-) x? ; y? - (1 + 7. 2 -) y," («) 

where 0 < » < n-1. The parameter 7 ,' in (10) or (11) may be -1 or 0 or 1 depending on the value of » and the type of 

rotations (i.e. m) [14, 15]. 

Haviland et ai. [14] realised the CORDIC algorithm on a CMOS chip and showed that the processing time of the 
CORDIC chip is 40 /is. They also suggested n = 13 as the minimum cycle time of a two-byte (24-bit) fixed point opera- 
tion. However, in practice, they used n = 24. For a conventional CORDIC module, it requires 5 shift-and-add modules 
to compute one CORDIC iteration and one normalisation iteration in parallel (that is, 3 shift-and-add modules for (7_a) 

and (7.b), and 2 shift-and-add modules for (ll)). The desired output can be obtained in 24 iterations (n = 24). Thus, 

24 iterations of 5 shift-and-add modules computing in parallel will be enough to realise CORDIC algorithms. This indi- 
cates that the CORDIC processing time is no slower than the time for a serial multiplier computing two 24-bit operands. 
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Figure 1 muttriMi the dMMtUry fbctio— Uut cu U obtained (m Ut CORDIG ptoc— or vka us k Ml to 
-I* 0, or I. to this Kftrt.a CORDIC pr oces s or is depicted u&ba with tkm inputs ap » f* > *» whkh art tto toifttol 
▼alee* of s, aid x, to (•), u ««Q as Urn outputs tkat cormpoad to Um ftu) values of ^ t | a , ud a. to (t). Tto, 
Um outputs a. , f. , a. are the dtand ekaeitaij functions, vk«« m is appropriately set to -1, 0, «r I. These CORDIC 
processors will be coaf gated aad coutcUd, based os a functional dtcoapoatios of tbs jotot nagk eftltow, to atvisa 
at am eflkieat arckitectars for tbs computation of iavtm kinematic position solution. 

2.2. CORDIC Arcbitocturo for bvtrst Kinematic Position Computation 

Tbs desaga philosophy is to examine tbs inverse ktosmatk posit ioa solatioa for its computational flow aad data 
depeadeacies ia order to functionally decompose tbe compatatioas iato a set of CORDIC computational modules (OCMb) 
with aa objective that each COM viD be reaSaable by a CORDIC processor* This functional dacsmposition is nut 
aa sfa r aad can be best represented by a directed task graph wiik a laite set of nodes denoting Odds and a rnrtispmd 
tog laite set of communication edges denoting operands movements. An examination of tbs inverse ktosmatk position 
equations to Appendix A shows a limited amount of parallelism with a large amount of wqurntisimi to tbo flow of com- 
putations aad data dependencies. This serial nature of tbs computational flow lends itself to a pipelined CORDIC pro- 
cessors implementation [IS]. The decomposition of tbe inverse kinematic position equations is duns by looking at Um 
equations which can be computed aa elementary functions by CORDIC processors as Gated to Figure 1. 

The task of computing the inverse kinematic position of a PUMA robot arm, based on tbe equations to Appendix 
A, can be decomposed into 25 snbtasks to which each subtask corresponds to a COM aad can be realised by a CORDIC 
processor. For example, # x can be computed by the following 3 sub tasks: 


Sub task 1: *1. - r — (p * + 



CORDIC Processor: CIRC 2 


t 

*0“P. 
»o - ~r, 
*0 “ ® 


Sabtuk 2: *2. - V'* - <1 - V*»2 “ *1 


CORDIC ProceMor: 


HYPE 2 


f 

*o-0 


Subtask 3: 


* 3 . 7 *i - * 1 . - 


CORDIC ProctMor: CIRC 2 


* 8 - * 2 . 
Il — 
* 9 - 11 , 


The computational flow of these 25 tasks together with the input data can be represented by tbe directed acyclic data 
dependency graph (ADDG) with switching nodes aad parallel edges as shown to Figure 2 and tbe details about tbe 
decomposition of the inverse kinematic position solution into CCMs can be found to [4]. In Figure 2, each computational 
node, indicated by a circle, represents a CORDIC computational module, and each switching node, indicated by a dot, 
perforins no computations but just switches data to various CCMs. The operands or data move along the edges. A 
major bottleneck in achieving maximum throughput or maximum pipelining to Figure 2 ia the different arrival time of 
the input data at the multi-input CCMs (e.g. nodes T18 and T22 ia Figure 2). The computatioas of multi-input CCMs 
can not be initiated until all Um input data have arrived. This different arrival time of input data leagtbens the pipe- 
lined time. Thus, the ADDG is said to be unbalanced aad fails to achieve maximum pipelining. Several teckaiques [l7j- 
[19] have been suggested to remedy this data arrival problem by imserttog appropriate a umber of buffers (or delays) to 
some of the paths from the input node z to the multi-input CCMs to “balance” the ADDG and achieve maximum pipe- 
lining. This buffer assignment problem for balancing the ADDG can be reduced to aa integer linear optimisation prob- 
lem. Detailed formulation of the optimal buffer assignment problem as an integer linear optimisation problem can bu 
found in [4]. After solving the buffer assignment problem, realisation of the balanced ADDG results to a maximum pipe- 
lined CORDIC architecture. For a PUMA robot arm, the architecture consists of 25 CORDIC processors and 141 buffur 
stages with 4 tapped- delay- line- buffers [4). The initial time delay of the pipeline is equal to 18 stage latency (or 
720 ps)» where the stage latency of a CORDIC processor is assumed to be 40 p* [14]. The pipelined time of tbe 
CORDIC architecture equals to one stage latency or 40 pa. The realisation of the maximum pipelined CORDIC archi- 
tecture is shown in Figure 3. 
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S. Invtm Py uimWi Coa p ^ Utt o a 

Tk gftml iamtt ijiaak problem lor u «-Bak ouipvUtor cam bo stated mo follow: itmlkjimt positfe— 
and velocities , f,(t) ) "_| which describe the state of tbe nuiptbtor mi time f , together with tbe jsint accelera- 

tions (£(«)}•», which are desired mi the poiat, solve the djumk equtiort of motion for the joimt toryii {»>(!)}*-* M 
follows: 

(u) 

wfc*r« f(l) - Jr„ * fc ...,r.) r , q(t) - , q(t) - (f „ ,*.) r , 3<) “ ft’i. f»-~4.| r . H ) * ** •*! 

vector function mmd »« prescript T demotes transpose operatioe om matrices mmd vectors. 

At preseat, mmch atteatiom has bee a hewed om the coespetatiomaJ issees of the inverse dynamics based om the 
Newton-Euler (NE) formulation, resulting ia various meltiprocessor-hased coatrol syslenw |t ,21-24]. The le c ur sr re strmc* 
tare of the NE equations of motion is obviously well smited to standard mngk-instr met ion-stream and single-da ta-stream 
(SISD) computers. It is, however, act aa eficieat parallel proressiag for acw aigh instruction-str ea m mmd muktipk- 
d a ta-stream (SD4D) com paters that are capable of performimg many simultaneous operations. Oar app r oa c h in design- 
tag eficieat algorithms for compatiag the robot ia verse dynamics is to look at the compete tional complexity of the 
problem Irst. Ia partkelar, we aeed to kaow what is the Kmitatioa of speediag ap the computation ef the ia verse 
dynamics while fanning oa p processors, where l < p < n. That is, we would like to establish a time lower boend for 
the inverse dynamics comp station problem so that several eficieat compatational schemes can he compared and con- 
trasted. Them eficieat algorithms achieving the time lower boaad caa be designed for the competation of the inverst 
dynamics. The following notations and lemma will be ased to derive the time lower boaad of the inverse dynamic prob- 
lem. 

Notations: 

(1) Lower ar ithme tic expression is any well-formed string composed of foar arithmetic operators (+,— ,X,/) or, for 

convenience, two operators +(or -), X (or /), left and right parentheses, and atoms, which ars constants or vari- 
ables. We denote a linear arithmetic expression E of m distinct atoms by E<m>, e.g. E<4> :•+!-« / d. 

(2) T, |/,(*) , /j(*) , ... , /.(•)) » Miiimim competing time needed to evaluate [ /,(•) , /,(•) , ... , /,(•) ) 

using p processors. 

Lemma 1: The time lower bound of T p (f?<m>] [25j. The shortest parallel time to evalaate a linear arithmetic 
expression E<m> using p processors is bounded below by and equal to 0{\m/p] + flog 2 p]), that is, 

T , (E<m> ] > 0\\m/p} + [log 3 pi) 

Theorem 1: The shortest parallel time to evaluate the joiat torques { r ; (0) ^ equation (12) using p processors 

is bounded below by 0(k t fa/p] + k^ flog a p]), where k x and k 7 are specified constants, that is, 

r,[r„rj,...,r,] > Ofk, fi»/^l + **flo*jpl) (IJ) 

The proof of Theorem 1 can be found in [10]. Two ext** me cases follow from Theorem 1: 

(a) If p « l, then the shortest computing time T f [?!. ?*...,?»] is not lower than 0(a). Thus, the NE formulation is the 
most efficient algorithm of evaluating the inverse dynamics running ia uniprocessor computers. 

(b) If p = », then the shortest parallel computing time T p [fit 1 *— » r «] is not lower than Offlog^]). 

Theorem 1 indicates that an efficient algorithm running on p processors may not achieve the same time order as 
0(k, [*/p] + k,flog 3 p]). However, if a parallel algorithm possesses the time lower bound, then it mast be the most 
efficient algorithm of evaluating the inverse dynamics. Theorem 1 also indicates that, although NB formulation is very 
efficient for computing the inverse dynamics, a better solution is to find an efficient parallel algorithm, ramming on p pro- 
cessors, that possemes a time order of 0(k x [n/p] + k* [log* p]). A parallel algorithm ruaning on an SIMD machine and 
achieving the time lower bound is discussed next. 

The recursive NE equations of motion are very efficient in evaluating the inverse dynamics whether they are for- 
mulated in the base coordinate frame [6] or the link coordinate frames [7]. The clear advantage of referencing both the 
dynamics and kinematics to the link coordinates is to obviate a great deal of coordinate transformations and to allow 
the link inertia tensor to be fixed In each link coordinate frame, which results in a much faster computation ia a unipro- 
cessor computer. However, the recursive structure of this formulation is in an in homogeneous linear recursive form, e.g~ 
w, * a, w,_, + i lf where a, * (n 3x3 rotation matrix) and i, ~ *oq,t» which requires more calculations and 

arrangements for parallel processing than the homogeneous linear recursive form. On the other hand, the NE formula- 
tion im the base coordinates caa be re-arranged and transformed into a homogeneous linear recurrence form, e-g. 
w, * + f t a,—,, which is more suitable for parallel processing on aa SIMD computer, yielding a much shorter com- 

puting time. 

Once the NB equations of motion are formulated in the base coordinates in a homogeneous linear recurrence form, 
then a parallel algorithm, called recursive doubling [16,1 7,24-27), can be utilised to compute the kinematics in the for- 
ward equations and the dynamics (or torques) in the backward equations [5]. The homogeneous linear recurrence prob- 
lem of site (*+l) can be described as follows: given x(0) - «(0) * identity and «(*), 1 < » < a, fiad x(l)* x(2), ... ,*(*) 
by an algorithm running on an SIMD computer of n processors, where 

t n, u4 1; uf im 4 k«N m vktiabM. 
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«(.) - *{.-!) * .(0 ( M ) 

m0m Imtot u tMociaUft eytn tof, vkkJi eu be a malm pm 4a ct or idfitlw, a rector dot p roduct or s ddili m, 
•tc. If «(0) ii tyil to aa ideatii y, then the linear recurrence of sis* (*+l) caa be redact to tbe case of om * by lab 
ia| a itiA operation as follows: «(0)«-a(l) t •(•-!)♦-•(»)» where V" denotes a replacing opera 

tioa. Tie r e cu r sive doubling algorithm bask ally solves tbe homogeneous Sam recurrence fora ia (14) by ipfiUiat tbe 
computations of a serial associative opera tioa. For aa arbitrary *(a)^ tbe generalisation of tbe iba aloes (a-fiyi 

parallel operations at tbe &nt splitting, (a + 1 )/4 at tbe ac o t d, and (a+lj/l 1 at tbe k tb mattl [bfjail)] spSUb p L 

tbea a (a) ia csaysUd with oac laal operation. Similarly, it caa be mo w n tkat aft) caa be rnapilid ia [lo| 9 (i‘fl| 
ipfittbp, where 1 < t < a. la other words, with tVe recursive doobfing algorithm, x(l), *(2), ... , x(a) caa be comp ete d 
concurrently ao later tbaa tbe time step fkgjs + l)]. 

Tbe proc e d ure ia cosapotiag the inverse dynamics from tbe NB epit ioe s of motion formalated ia tbe base coe ad t 
mates ia to re-arrange tbe kiaematk epalioM atd tbe dyitaic agnations ia a bomo gtat oa a linear a cswl t t Iona, aad 
tbe following mpot paraaaeters relating tbe fbnunlatioa moat be given or evaluated in advance: 

(a) Tbe 3x3 rotation matrices '**0,, t » 1,2,...,*, mrbkb abates tbe orientation of Salt s' coordinates ref erence d In 
fink (a — l) coordinates, need to be evaluated ia advance. 

(b) *p* denotes tke origin of Bak t coordinate frame from tbe origin of Eak (a— 1) coordinate frame ex p res s e d wilb 
respect to Bak t coordinates; f s, denotes tke location of tbe center of mam of Bak « from tke origin of Bak t coor- 
dinate fraac expressed with respect to link i coordinates; aad * J, denotes tbe inertia matrix of Bak t a boot its 
center of mam expressed witk respect to Eak s’ coordinates, mast be given ia advance. Note that *p, , *a,-, aad *J,- 
are coastaats wbea referred to tbeir own link coordinates. 

(c) X, is a joint indicator wkick specifies link s’ is rotational or traariatioaal as follow: 

| 0 , if Eak s’ is rotational 
* \ 1 , if link s’ is translational 

(d) Let — 0 f pQ « \§ a9 p yv f g ] T aad [g| m IJQIllsi/s 1 . If external force f, and external smombI d, are 

exerting on Bak a, tbea f i+ , » f„ n t ,| * n 9 ; otherwise, f t , ( * n^t — 0. 

Tbe procedsre of evaluating tbe NB equations of motion as a linear recar re nee problem is then given below (note 1, ace 
need here as variables): 

STEP 1. Compete tbe rotation matrix °R, with respect to the base coordinates for i * 

•R, (») 


STEP 3. Compete p,* , e t> and s, for s’ — 1,2, ,a 

• *0 . *0 * (o«° 4 l r ; Pi'* °S, ‘Pi* ; - °R, ■«, 

Tbe evaluation of s, only involves taking tbe third column of °R,. 

STEP 3. Compete 

*. - (i - M 

and 

- *i-i + *• 

STEP 4. Compete 

“ (*•— I fi I X l|_i fj) (l “ ) 

and 

STEP S. Compete 

», - «. X p,* + *#, X (w, X p/) + (•<_, f { + * •»,- X (•,_» *, )) \ 

and 

?.• - p.-i + *. 

STEP fi. Compete 

(«, X a,-) + Pi 

STEP 7. CompaU 

r, - •,?, 

STEP t. CompaU 


d») 


to) 

(«) 

d*) 

W 

w 

(a) 

ta) 


N< - J, «. + w, X (J. *»,) 
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w 

t») 



tkmt tkc 


For tbt nfa of nviac tW cakilitioM of oralutiag J 4 » X *J 4 *R» wkick Lmk «t aL (7] ik i w f 
cmjmUUm m *ut« coaopBcotcJ, (») ■ bmUSmI to 

'•». - ‘*o •». - (**,) r •».- J {•K,) r 


STEP 1 Cwpite . 
STEP 18. CoaptU 


‘N. - *J 4 + •«, X f J 4 '«.) ; N 4 - •*, -N, 


fi - f 4+I + F* 


f 4 « n 4 + (p 4 + %) X f 4 + p 4 X fi>i 


(») 

(«) 

(») 

(») 


STEP 11. CoafiU 


■i - “•>« + *. 




( n i) r »i-i » if\-0 


(30) 


(31) 


Pnrioiilj vukiftd Um» l expressed la tkc beet coordinates, arc jpvo as follows: n\ is the mass of Ink * f «,• Is tkc 
angular Telocity of link «, w, is tkc asgilar acceleration of fink t, p, is tkc linear acceleration of link t, r, is the linear 
acceleration of tke center of mass of link i, F,- is tke total force exerted on link i at tke center of mass, N,* is tke total 
moment exerted on fink t at tke center of mass, I* is tke force exerted on fink i by fink t— 1, n,- is tke moment exerted 
on link i by fink »-l, t g is tke torque exerted by tke actnator at joint i if rotational, force if translational, f t * is tke 
joint Tariable of joint i (#,• if rotational and if if translational). 

Bqnation (1$) skows tkat tke eTalnation of °R, is a simple recursive matrix product form. Equations (18), (20), 
(22), (22), and (30) only inrolre simple recursive rector addition form. Tke other eqnations in tke NE eqnations can be 
competed parallelly. Thus, tke eralnation of tke total coropatational complexity of tke parallel algorithm for n PUMA 
robot arm can be derived as follows: 

(a) Tke parallel eralnation of (1$) using reenrrire doubling indicates (27 [log}*] — 19) scalar maltiplkations and 
(18 riofan] — 14) scalar additions. 

(b) Eqnations (18), (20), (22), (28), and (30) all bare tke same recursive rector addition form, the total parallel evalua- 
tion of these eqnations requires (6 + 9 flog^n+l)]) scalar additions. 

(c) Tke parallel eralnation of tke other eqnations in tke NE formnlation, e-g. °R, ' p,-* f s t - = °R, F,*, N,* t r i9 

and all tke h g of (17), (19), (21), and (29) can be calculated by simple parallel computations, yielding a constant 
computation of 135 scalar multiplications and 98 scalar additions. 

Combining tke results of (a), (b), and (c), tke total computational complexity of tke parallel algorithm applied to a 
PUMA robot arm is (27 flog}*] + 116) scalar multiplications and (24 [logjrt] + 9 flog^n + l)] + #4) scalar additions. 
Note tkat it is of time order Offfog}*]) because we are using p =* n processors. If further reduction on tke coefficients 
of ( flog2«1) ■ desirable, this can be accomplished by using matrix multiplier chips. This would reduce tke coefficients 
27 and 18 in evaluating (15) as discussed in (a). If m ■* 6, then the complexity of the parallel NE algoritkm is 197 multi- 
plies (malts) and 183 additions (adds) as compared with the complexity of the NE algoritkm running on a uniprocessor 
[7]: 852 mults and 738 adds. Moreover, even if • becomes large, say n — 12 (for redundant robots), then tke number of 
mnltiplications and additions increases only by 27 and 33, respectively. Thus, we have shown that considerable savings 
in computation time can be achieved from embedding the inverse dynamic computation in a parallel algorithm, which 
has a time complexity of logarithmic in the number of joints, 0( [log} *])- 

td. An Efficient Parallel Algorithm With p-Fold Parallelism 

Last section showed that the bottleneck of parallel computation of the inverse dynamics depends on solving the 
homogeneous finest recurrence of tke N-E formulation. If tke restriction tkat one microprocessor “handles” one joint is 
relaxed, it is desirable to obtain an efficient parallel algorithm which can greatly improve the evaluation of the linear 
recur re ace using p processors. A parallel algoritkm of evaluating tke inverse dynamics with a restricted number of p 
processors has been developed to achieve the time lower bound of 0(k t [*/pl + k* [log} pi). The proposed p-fold paral- 
lel algorithm can be best described as consisting of p-paraDel blocks with pipelined elements within each parallel block. 
Tke results from the computations in the p blocks form a new homogeneous linear recurrence of sire p , whi ch again can 
be computed using the recursive doubling algorithm. Tke parallel algoritkm with p-fold parallelism (PFP) is summar- 
ised and presented as follows: 

Algorithm PFP (p-fold Parallelism). This algoritkm divides tke computations into p -parallel blocks of computations. 
The j th processor computes the elements ia the / th block serially. Tke results from the the p -parallel blocks form a 
new homogeneous linear recurrence of sue p, which can be computed by the recursive doubling algorithm. 
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Pi. [hitUKtttioaj Givta «(0) + ideality, kt<« [(n+l)/y), m ~ ye, mad eel 

( 

j «(i-l) , 1 < » < u+i 


M0- 


Jo , u+1 < t < m 

wktw e indicates ike block da (member of elemeate ia a block). 


PS. [Diride iato y blocks^ Divide M*)» 1 < » < at, imlo p blocks me follows; 

ytb block - (M(y-1> + 1). M(i-*)« + 2), amd let N(l J) - M((j-1). + 1) 

where N[iJ) iadksteo tbe t th element ia the jib block. 


*</<* 


PS* [Compete N[iJ) ia a DO loopj Tbe /th processor serially competes N(»j) ia tbe /th block as: 
For • » 2 step 1 matil e Do 


w 


N[ij) - Nii-ij ) • M(y-i) «+o . i<y<p m 

END 

It is seem that s(l), r(l), ... , x(e— 1) has bees evaluated ia the DO loop as well as M*4l 2 < » < s. That k 
*(l) - N[ 24). *(2) - N( 34). ~ , *(«-!) - ^4 

Pd. (Form a aew homo g eneous linear recurrence of sise y.) Let Tf;) — N(#J) and y(j) » *(>-1), for 1 < / < p and 
referring to (14), (32), aad (33), we have 

* W * *0-1) - *(l/-i> - 1) * «((i-0) * •((/- 1> + 1) * ... * •(>-!) (34) 


- y(/-l) • MO-O + 1) * M(/»l> + 2) * ... * Af{» - y (j-1) * N(*,/) 

Equation (34) is a aew homogeneous linear recurrence of rise y which can be parallel}/ evaluated by tbe algorithm 
FOHRA, running in time proportional to 0( [log^ yl) and yielding tbe resulU x(«-l) « y(l), x(2s- 1) ** y(2), 
*(y«“l) - I (y)* (Note that if (*+l) is divisible by y, then x(n) - y(y); otherwise, *(ys-l) * y(p) » 0). 

P5. [Compete intermediate s(i) in equation (35).] Without loss of generality, assuming that (*+l) is not divisible by y, 
then there are intermediate terms of z(») that need to be determined. They are: 

*0+0 » C < » < s— 2, 1 < / < y-2 and *((y— 0+0 t 0 < » < n-(p— l)s (25) 

Referring to (7), (33), (34), and (35), thereby giving 

*0+0 - *0-1) * «0) * *0+0 # ... * *0+0 - y(j) # M>+0 * M>+0 0 ... * M>+*+i) (*) 

- y(i) 0 N(t+i j) , o < i < s-2/ < / < p - 2 


and *((y- !)• + 0 — t(P“l) * M*+ 1#— 1) . 0 < i < u-(y-l)s 

where //(i+lj), y(jf) of (34) have been evaluated in steps PS and P4, respectively. Equation (34) shows that if 
(w— y— f+3) ta sk s are of tbe same evaluation, then the calculations of these equal tasks are suited to an SIMD com- 
puter of y processors. It is shown that parallel evaluation of (34) requires r(n-y-s+3)/yl time steps (note that ? 
(»+l) is divisible by y, then x(n) can be evaluated in step P4, yielding (a-y-s+2) equal tasks in (34), thereby 
requiring [(u-y— s+2)/y] time steps). 

END PFP 

U b seen that tbe total parallel computing time T p of the homogeneous linear recurrence of rise (s+l) usi n g p 
processors is: 

f rO+l)/>l+ r 0— p— *+3)/yl+ Tlosa Pi - 1» if (*+l) in not divisible by y. 

Tp " lf(*+I)/jl+f(*-y-«+2)/yl+ Tioga y] - 1, if (*+l)is divisible by y. 

Applying the above y.fold parallel algorithm to the N-B formulation for an *.fink rotary manipulator, it is able to 
achieve the time lower bound of 0{k x [+/p\ + k^flog, yl). 

The above n-fold parallel algorithm is suited to he run on an SIMD computer. A cascade structure can he umd 
for connecting the PEs. An alternate structure is to position a network between the processors and memories. The 
interconnection pattern, called the “perfect shuffle (24, 27],“ has the number of links between processors proportional to 
*. An attractive interconnection pattern, called “inverse perfect shuffle (26*27]," is suitable for tbe implementation ef 
solving the homogeneous linear recurrence aad can be obtained by reversing the arrows of tbe perfect shuffle. Detail 
about this network connection for solving tbe homogeneous linear recurrence for computing the joint torques can be 
found in [10]. 


4. Conclusion 

A maximum pipelined CORDIC architecture for computing the inverse kinematic position solution and an efficient 
parallel algorithm for computing the joint torques have been discussed. To achieve maximum throughput, delay btim 
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an to baluct tW pipeline. For a PUMA robot aim, tk CORDIC arddUctaic conmta o f 2$ CORDIC proces- 

■on aoi 141 Mtr stages with 4 tappa^elay-fiM kina. Tk initial tint delay of the pipeline is cqaal to 720 p* aad 
Iks fipdM tins of Iks CORDIC aicUttctan ego a It to oaa stage lateacy or 40 ps. The paialkl alpnUs lor coo 
pvtiat Iks Joint t o r q ots ku & time c omp l exi t y of io|ariUune in the amber of joints. The interconnection networks for 
tko p roces so rs ban aln boon investigated to improve tko itSHuUwi of commeakation ud internal btltnif betwcei 
prow ora in an SD4D compnter. Ubkg tko concopts of tko p ropo s ed parallel algorithm, it woald bo powilh to device a 
VLSI chip capable of implementing the inv ers e dyaanks compoUtion at speed priaianly boondod by tko pro p on e d 


b. Appmdb At bmm Kinematic Position Rotation 

Tbo hnm kinematic position probity can be stated ac Gina the poeitioa/oricatatioa of the maaipolator hand 
and the Hak/jomt pare meters, determine tbs joint angles so that the maaipolator can bo positioned so dobed. That is, 
g i r en 

■* n «. L 
_ *S S *1 7| 

*» •, «, 9* 

0 0 0 1 


n o a p 
0 0 0 1 


tbo joiat angle equations are: 

r-W + ftY* 


Un^l 


K 

- tan- 1 

** 

l* J 


±Vr*-di 


/n* " PmC 1 + f 9 S 1 ; film ** •*^1 + *>^1 > /«f* — F*^I + > f 13# - — **^1 + O f C| 

* * /n, + /nr - " •* “ •* i « - 


tan 


-1^1. 

^4 


— tan" 


±Ve -d* 

*r(/ — PjCJ + d 4 /ny — 


•a(/nr^a + P*^a) + *3/11* + rf aP* 


» — tan' 


-1 


(P*^3 “ /n^j) + (— ~)/iir + (~)P* 

4 J *3 


(p,Cj + /u,Sj) + (~)/nr + (~)p* 

•a *i 


P* ~ P» ~ fa 

— tan 


-S,a, + C l a y 


CAC A + *iS) " $»•» 


tan* 1 


/ 13# 


*?»/ 1U “ ^W*» 


tan“* 


gaK^Ci*, + V,) - + <?isl 


^ut^io, + 5,0,) + C„s, 


-tan“ l 


Ca(C 0 / He " ^33*t) + ^a/ 134 


^»/lU + ^23®* 


tan 


-CJgfe/n, ~ Sa».) 1 SJ iU ) ± *»($«/„. + C^o,) } 


SJPnf n» - ^s»i) + ^«/u# 


when (— — ) , (— ) , (— ) an constants, C f * con 5 f w sin C $ bcm (8 { + #,•), and Sy a sin (*,. + #,-). 

ft. R*f< 


(A*l) 

(A-2) 

(A-3) 

(A*d) 

(A-5) 

(A-«) 

(A-7) 

(A-8) 

(A-9) 

(A-10) 
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Figure 2 T**k Graph for A PUMA Invent Kinematic Poeltlon Solution 




































